Small (breaking) refactor, refactor tests, add MIT license#104
Merged
svwingerden merged 17 commits intomainfrom Jun 25, 2025
Merged
Small (breaking) refactor, refactor tests, add MIT license#104svwingerden merged 17 commits intomainfrom
svwingerden merged 17 commits intomainfrom
Conversation
Init loss now logged even if online=False
…quired refactor) (#427) * WIP * FIX gradscaler for amp * RM some prints * FIX zero_grad in SGMCMC * FIX context management for autocast * WIP example notebook * FIX gradientscaler only if use_amp, add todo * FIX use bfloat16 instead, remove gradscaler * WIP update notebook * FIX autocast also on TPU? also change dtype of loaded model * RM old code * WIP cuda fixes * RM example notebook * RFC USE_XLA to BF16 and FP16 * RFC BF16 and FP16 * FIX Optional expects an argument * RM profiling code * RFC optimizer_kwargs to sampling_method_kwargs * WIP priot kwargs refactor * WIP TrainingArguments in CustomCheckpointCallback * FIX CustomCheckpointCallback refactor * FIX SGMCMC vs SGLD incorrect distance comparison (weight decay currently broken) * FIX SGMCMC w/ wd * UNDO bfloat16 changes * FIX pythia model now not automatically loaded to cuda >:( * ADD working pythia6.9B yaml * FIX pythia test * FIX BF16 now default * CHG docs * CHG yaml files * FIX prior_kwartgs in optimizers and samplers * FIX tests * RFC Checkpointer now uses model_args * FIX loss tests * CHG devinterp typing changes * RFC BF16 now default * FIX dataclass dtype should be str * CHG prep for the big run * CHG n_ctx to 512 * FIX test formatting, dtype getattr magic * FIX NUM_GPUS for single-gpu machines * undo to 512 * UNDO pythia checkpoint is not 1024 * FIX checkpoint test * FIX checkpointer test, loss test precision * WIP refactor BF16 type, getattr back in sample loop * FIX typing * FIX yaml requires strings for unknown types? * FIX sampler dtype change * FIX typing of dtype arg * FIX memory usage handling * FIX don't copy and move model is using SPMD * FIX Jesse's minor requests * FIX formatting (sorry Jesse) * FIX is_dataclass in hparams
* gitignore updated * n_ctx sent to tokenize_dataset * hooked up n_ctx to every instance of get_datasets * last get_datasets * printing config' * using pretty printer * printing task metadata * rm nctx req * pretty printing metadata * added update n_ctx function * logging context len * right quotation * context length up top * change applied everywhere * added test * black * added n_ctx to test yaml * equals * Move update_n_ctx call before logger configuration * Remove unused parameter `n_ctx` from function * Add newline in _get_tokenize_kwargs function * fixed quantize action typo * rm commented out steps * black * suprious commit * tokenizer max length set * removed n_ctx setting * Fix logger comment formatting in llc.py * installed tpu thing * Remove 'hub/' from .gitignore * FIX bug * printing max len * FIX test * FIX loss tests now use pythia temporarily, update snapshots to match this change * FIX tokenizer tests * FIX formatting * FIX formatting even more * FIX merge issue, test * FIX workflow file now shows which tests fail * WIP CI/CD * CHG env vars for loss tests --------- Co-authored-by: Jesse Hoogland <jessequinten@gmail.com> Co-authored-by: svwingerden <stanvanwingerden@gmail.com>
* RFC grad_accum_steps -> gradient_accumulation_steps * FIX formatting * nonsense commit to test CI/CD * [Aether] Fix init loss (#483) * Account for gradient accumulation steps when computing init loss * Linting --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: George <gwang24@gmail.com>
…ter (#443) * prototype pyproject usage and Makefile changes * Commit ready for PR * Use lock instead of sync * Upgraded lock file and fixed newline formatting * moved dev dependencies to dev dependency groups and updated submodule_test.yml for uv * FIX formatting * Improve UV installation and detection in Makefile * Add pre-commit install steps to Makefile * Fixing pytest execution (run aether tests from root dir, not shared/aether) --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>
* Add changes (dirty branch, don't merge to main) * Add dirty changes * Add dirty changes * Stashing changes * Stashing changes * Add process group cleanup + dataloader distributed sampling * Add all-reduce on metrics * Fix all-reduce device issue * Dirty commit (not working) - halfway fix for the checkpoint loading problem * Super dirty commit - blocked on checkpoint loading problem * Fix a bunch of issues. Debug statements remain * First clean-ish commit - functioning FSDP for a fixed model set in llc.py. Dataloader splitting is still problematic - there's replication of data between GPUs. * Formatting with black * Added profiling to examine FSDP memory imbalance * Fix sampler batch size * Fixes for the layernorm problem and the dataloader interleaving problem. * Move destroy_process_group after action completion + fix OMP worker overload bug * Add gradient checkpointing * Add FSDP/non-FSDP comparison test * Bugfix for non-FSDP gpu smampling * Add snapshot tests * Fix fsdp pytest * Black formatting * Add seeding + rmsprop tests for FSDP * Simplify fsdp test + add syrupy snapshot * Format tests with Black * ADD TODOs for Stan * FIX SPMD multi chain, address some old will comments * update snapshot * FIX snapshot for fsdp * FIX tests I broke accidentally * FIX snapshot test full precision * ok now the tests should really pass w/ full precision --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>
[CI/CD] Add GPU test action
…tests-shared-fixtures [Aether] Einar/eng 178 refactor tests shared fixtures
* ADD TPU test * CHG TPU config * ADD misc * Readd ssh key * ADD smoke test * ADD import torch_xla * FIX tpu dependencies * FIX dep for gpu? * Readd torch_xla explicitly * Actually fix * Maybe this is better? * Ready now? * Don't restrict tests * Shrink tpu test * ADD server port for gpu * FIX test? * FIX GPU ssh connectivity test * ADD TPU_TYPE * FIX TPU_TYPE passing * CHG secrets. -> env. * CHG env. -> vars. * UPD test * ADD cln up comparison * UPD test? * More relaxed tests * Try second round of tests on tpu * FIX aether tests * UPDATE sync * ADD skips if missing file * FIX tests? * FIX tests for real * Consolidate tests into one * Final fix? * Skip breaking test * Skip breaking test * Skip breaking test * FIX gpu? * Skip TPU test if WIP * NVM * Actually skip * Skip TPU if WIP * CHG which test is being run * Rerun * Comment out failing test * Trust o3 * Wrong indenting * RM find-links * Skip 31m * Reward-hacked my way out of this * FIX checkpoint naming * Update tpu_test.yml * Update gpu_test.yml * Update tpu_test.yml * FIX tests for CI/CD * CHG verbose now false in test_tpus, check for SPMD in gpu tests * FIX formatting * FIX failing data tests (and bypass one) --------- Co-authored-by: svwingerden <stanvanwingerden@gmail.com>
* make * imports explicit * move to snapshots for sampler_accuracy_test * try up validation chains & draws * Remove normal-crossing tests, as they are pretty inconsistent and don't converge well. * Refactor rrr_test, including caching model training and using snapshots * Format with black * Run snapshot test first every time, for rng seeding consistency. * fix mismatch between snapshot/non-snapshot * Comment out sampler_ordinality test for now, to debug CI faster * Refactor rllc_test into fast/slow paradigm * Seed the dataset generator, and re-snapshot * Make sure dataloaders are deterministic in conftest. * Fix rllc_test failures by commenting out powers = [1, 1] test. * Refactor ordinality test into snapshot format. * Reformat ordinality test. * Format conftest * Remove burnin steps comment * Add architecture check, and skip some tests if not x86_64.
* Add fix for pyright not finding installed packages. * Fix relative path in timaeus_cli pyproject.toml --------- Co-authored-by: Stan van Wingerden <stanvanwingerden@gmail.com>
* WIP use self-hosted runners for speedup? * WIP * remove ssh for ci/cd * FIX GPU cicd * ADD concurrency to all git ci/cd jobs * concurrency check * devinterp tests don't need our cool machines * CHG caching for cicd on gpu * skip two failing ci/cd tests * CHG cicd caching * CHG pytest skip markings * FIX hessian test * FIX failing CI/CD * CHG CI/CD to use correct DISK_PATH * CHG -n 2 to prevent DDoS * CHG workflow yamls caching, n back to auto * FIX formatting
* CHG use ruff isntead of black * FIX CI/CD * CHG .vscode settings and suggested extensions * delete the empty ipynb files * Update Makefile Co-authored-by: William Snell <59493198+williamsnell@users.noreply.github.com> * FIX aiohttp from merge conflict * CHG don't lint on commit, don't lint projects folder * don't lint/ format projects folder * FIX formatting ;) * WIP pre-commit, CI/CD, config, makefile * WIP add linting check (that will fail) * CHG disable linting autofix * FIX linting & formatting * WIP CI/CD * FIX makefile setup old precommit hooks, attempt 2 * WIP CICD * WIP CICD * FIX CICD * slow failing mala tests --------- Co-authored-by: William Snell <59493198+williamsnell@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Ported over from internal changes)